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Abstract 

We propose a new data-centric synchronization framework for carrying out of machine learning 
(ML) tasks in a distributed environment. Our framework exploits the iterative nature of ML al¬ 
gorithms and relaxes the application agnostic bulk synchronization parallel (BSP) paradigm that 
has previously been used for distributed machine learning. Data-centric synchronization comple¬ 
ments function-centric synchronization based on using stale updates to increase the throughput 
of distributed ML computations. Experiments to validate our framework suggest that we can 
attain substantial improvement over BSP while guaranteeing sequential correctness of ML tasks. 


1 Introduction and Related Work 

In an increasing number of application domains ranging from to speech and image recognition systems to online 
advertising, both the size of data sets and the complexity of learning models continues to increase. It is now not 
uncommon to train machine learning models which consist of over a billion parameter^] and data points (9j SI . 
For such large scale machine learning models it becomes necessary to train and deploy them in a distributed 
environment. In an ideal setting, the speed-up obtained in a distributed setting should be proportional to the 
number of computation nodes available in the system. However in practice machine learning models often require 
the computation between the nodes to be synchronized resulting in a dramatic reduction in effective parallelism. 

There are different forms of synchronization that can be architected in a distributed system. The most common 
form can be described as process synchronization. For example consider a shared memory system, where tasks 
(typically model parameters updates) are distributed between different threads (workers) but there is a common 
memory bank to which all workers read and write. Computation is often carried out in phases and at the end 
of each phase, all workers wait till the last worker has finished its task and has saved its computation in shared 
memory. However, previous studies have shown that the task time across different workers often follow a skewed 
distribution and that the overall time is bottlenecked by the worker which takes the longest time to finish its task. 
This is often termed as the Straggler or the Last Reducer Problem 0H2]. Notice that while process synchronization 
ensures that the output of the computation is sequentially correct, i.e., the same output is guaranteed to be obtained 
as if it were executed on a single worker, it is completely agnostic of the nature of specific task executed. 

While process synchronization is problem independent, a new form of synchronization has emerged specifically 
for machine learning tasks. We will refer to it as function synchronization At the highest level of 

abstraction, machine learning reduces to estimating model parameters with the objective that the model output will 
closely align with observable (existing and future) data. However, depending upon the nature of the specific task, 
the loss function used in the objective function can be different. Function synchronization relaxes the full process 
synchronization barrier by allowing workers to operate in a controlled but asynchronous manner. For example, 
workers are allowed to operate using old outputs of other workers as long as the old values are within a function- 
specific bounded delay 0. The estimate of the delay allowed depends upon the nature of the loss function. 
However the general rule of thumb is that “smooth” loss functions can tolerate longer delays compared to their 
non-smooth counterparts 0. A more radical approach has been proposed where workers are allowed to update 
parameters in a completely asynchronous manner. When data is extremely sparse (which is a common occurrence 
in many application settings) and the stochastic gradient algorithm is used (thus data access is random), the chance 

'the terms “parameter” and “feature” are often used interchangeably. 


2 



of update conflicts between workers turns out to be extremely small. However the complete asynchronous approach 
comes with almost no theoretical guarantees HD. Nomad HD, on the other hand, is a non-locking distributed 
protocol for matrix completion that leverages function semantics for ensuring serializability of concurrent updates. 

However there is another form of synchronization possible which has been largely ignored by the machine learning 
community. We will refer to it as data-centric synchronization and has roots in database transaction systems. 
A transaction in a database system is a set of database operations (typically read, write and update) which are 
guaranteed to be executed in an atomic manner. In order to increase throughput a modern database systems 
allows transactions to be executed in a concurrent fashion while guaranteeing sequential correctness. The logic 
of concurrency in a database system does not depend upon the semantics of the high level database query but on 
the properties of read, write and update operations. Through the use of carefully designed data access protocols 
a substantial amount of concurrency can be achieved in database systems. Recently, proposals have emerged to 
leverage optimistic concurrency control (OCC) from database serializability theory for function synchronization 
in the context of machine learning problems such as unsupervised clustering HQ|. However we note that the 
validation step of OCC in this proposal is used for detection of semantic violation due to data partitioning and not 
synchronization violation due to concurrent data access. 

In order to apply data-centric synchronization methods for machine learning we have to focus on the typical 
algorithm used to estimate model parameters rather than the task specific functional form that describes the model. 
It may come as a surprise that many machine learning tasks can be abstracted to carrying out an iterative operation 
based on the template 0: 

Oi[a + l]= f i {6 1 [a],...,6 n [a]) Vi=l,...n (1) 

Here each 0* are the model parameters, a is the iteration number and /, is the update function for variable 0, and 
subsumes the (immutable) data set. Notice, each is only used for updating 0,. We can think of the parameters as 
data elements (thus the name parameter database) and a transaction as a single iteration which consists of updating 
all the parameters. For simplicity assume that each Xi is assigned to a unique worker and thus, at first glance, 
the worker at iteration a + 1 has to wait for all the parameter values at iteration a to be known, i.e., it has to 
wait for all other workers to finish their task. However by designing access protocols we will show how the above 
assumption can be relaxed and transactions (iterations) can be executed in a concurrent fashion while guaranteeing 
full sequential correctness. In the process we will show that the existing database concurrency control protocols 
(like two-phase locking) does not apply in this setting and that new protocols need to be designed to parallelize 
fixed point iterative computation which can overcome the process synchronization barrier. 

We summarize our contributions as follows: 

1. A new form of data-centric synchronization is introduced to speed-up machine learning tasks while guar¬ 
anteeing sequential correctness. New ML systems can be designed which can combine functional and 
data-centric synchronization as they are mutually independent. 

2. We will show that traditional data level access protocols like two-phase locking are not strong enough to 
support iterative computation which is characteristic of machine learning. 

3. We will develop a new theory of data-centric synchronization specifically for fixed point iterative compu¬ 
tation with accompanying provable guarantees for sequential correctness. 

4. Experiments on a prototype machine learning task (linear regression) will show that using our relaxed 
data access protocols we can obtain fifty to eighty percent speed-up compared to implementations that 
enforce process synchronization. 

The rest of the paper is as follows: In Section [2]we explain how algorithms for many ML tasks can be abstracted 
as a fixed-point iteration computation. In Section [3]and [4]we present the theoretical foundations of data-centric 
synchronization and relate it to BSR A simple protocol to implement our proposed framework is detailed in Section 
[5] Experiments to test the validity of our approach are presented in Section [6] Section [Tlcontains an extension 
of the data-centric model to incorporate bounded delay updates. We conclude in Section j8]with a summary. The 
supplementary section contains all the proofs. 

2 ML Abstraction and Scope 

Machine Learning problems are now being increasingly formulated as optimization problems which take a precise 
form described as 

mm h{6)= fj( e ) + 9(0) (2) 

je data 
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The function fj measures the discrepancy between the model (9) and the data while g(O) is a regularizer term 
to prevent the model from overfitting the data and encouraging certain forms of solutions (e.g., sparse or spatially 
contiguous). 

The optimization problem as in Equation[2]rarely admit analytical solutions and recourse is often taken to iterative 
algorithms like gradient descent which follow the template of Equation |T| More specifically an update at iteration 
a + 1 is derived from a as 

9[a + 1] = 6[a) — t]\7h(9[a\) 


Or expressing it in scalar form 


6i[a + 1 ] 


Oi[a] - rj 


dh(9[a]) 

d9i 


Note that the component update 0; depends upon the availability of the full 9 component values from the previous 
iteration. 


2.1 Scope 

A large body of research (both in the ML and optimization community) has tackled problems related to the con¬ 
vergence of gradient descent and similar methods, setting the learning parameter t], the choice of the data term /, 
and the regularizer g. Our contribution is orthogonal and we will assume that we are operating in a loss function 
regime where these issues have been addressed. 

Furthermore, the nature of ML solutions is such that the model parameter solution can admit a higher degree of 
imprecision compared to other application domains. In fact functional synchronization exploits this characteristic 
of ML solutions. However, data-centric synchronization will guarantee sequential correctness, i.e., we will able to 
provably show that we can carry certain types of concurrent (inter-iteration) updates in gradient descent algorithms 
where the end result will be exactly like if the algorithm was executed in a sequential manner. Extending data- 
centric syncrhonization to incorporate bounded delays is relatively straightforward and is briefly explained in the 
paper. 

3 Data-centric Synchronization 

The design space for parallelization can be broadly classified as follows: (i) Data Partitioning: Training data is 
divided among multiple workers and each worker node is responsible for learning all the parameters based on its 
chunk of training data; (ii) Feature Partitioning: A worker node is responsible for computing updates for a chunk 
of the feature space based on entire training data; and (iii) Data and Feature Partitioning: A Worker node is 
responsible for computing updates for a chunk of the feature space based on a chunk of training data. Our formal 
development in this paper is restricted to the partitioning of the features, i.e., case (iii). 

Database Management Systems (DBMSs) significantly simplify the application development process by providing 
the transaction abstraction J6l that makes the issues of concurrency, synchronization, and failures transparent to 
the developer. Underlying the transaction concept, is a data-centric synchronization technique such as two-phase 
locking 0 that synchronizes read and write accesses from concurrent transactions to ensure that the interleaved 
execution of these transactions is correct (formally, referred to as being serializable 12). When considering ML 
computations, a natural question arises if the transaction concept with two-phase locking from DBMS can be 
used for iterative computations where iterations are parallelized overs multiple workers ? A logical mapping will 
be to view each iteration of a worker as a transaction unit that will ensure that in each iteration the workers are 
isolated from each other and are executed serially. This is not desirable, since the sequential semantics of an ML 
computation requires that in each iteration the reads of all workers are executed before the writes of all workers in 
the same iteration. 

From the sequential semantics of an ML computation, we observe that the notion of transaction is not an iteration 
per worker; rather it should be mapped to an iteration a across all workers. Unfortunately, none of the DBMSs 
to our knowledege support the notion of splitting a transaction across multiple nodes. Even if this notion is sup¬ 
ported, there is an additional constraint which requires that the only serialization order of transactions is the one 
in which iterations are executed sequentially in strictly increasing order of the iteration number. To the best of our 
knowledge, none of the general-purpose DBMSs support such meta-level synchronization of transactions. 
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4 Model DB Synchronization 


In this section we develop the theoretical machinery which will underpin data synchronization. The key idea is 
to relax process synchronization by introducing separate read (RC) and write (WC) constraints. We will show 
that process synchronization implies RC and WC and that enforcing the two constraints guarantees sequential 
correctness while allowing for asynchronous execution. 

4.1 Data Model 

Given the enormous size of the model variable vector, and the fact that model variables are both read and written 
during the model computation, it seems natural to store the model variables as a database managed over multiple 
servers. 

Definition 1 ML Parameter Database. Given a machine learning model M with parameters 0 = 

{6\, $ 2 , • ■ •, 0 m }, a database D, the parameter database is denoted as M(Q, D ). 

Note that in a parameter database M (X. D), the data set 1) is immutable and the parameters are inferred from 
the data using a machine learning algorithm. In a shared memory system we are interested in partitioning the 
parameters of the machine learning model so that they can be independently managed and updated by workers. 
We next give a formal definition for partitioning the parameter space into disjoint partitions that can be managed 
independently. 

Definition 2 ML Feature Partitions. A partition set II consists of p partitions over the parameter database 
M (0, D) denoted by 

II = {iTl, 7r 2 , 7T 3 , • • • , Ttp) 

where each tti may consist of one or more parameters such that: (i) tti = {9^, 6i 2 ,0i 3 , • • • , 9i d }; (ii)^i,j tti IT 
v 

7 tj = f; and (Hi) II = 

i—1 

Note that the partitions is a logical concept in that the partitions may be stored in a single database server or in the 
extreme case may be distributed over p database servers. There will be bijection between workers and partition 
and each worker i will be responsible for updating the partition 7r.j associated with it. 

Definition3 Database Access Model. An iteration a at worker i consists of reads denoted as r, [nj ] [a] and 
writes Wj [tt-;] [a]. The read and write accesses within each iteration a at worker i are such that: Vj r,:[7rj][a] < 
[ 7 ^] [a], where < denotes the happens-before relation. For a system consisting of a single worker we will 
suppress subscripts for the worker and denote read and write access as r[ttj] [a] and w[ttj] [a] respectively. 

4.2 Sequential ML Execution 

In this section, we illustrate the sequential computation in Algorithm |T| that will serve as the foundation for the 
correctness semantics of parallelized ML computations. 


Algorithm 1 Sequential Computation 

1: procedure ML Computation 
2: initialize iteration a 

3: while not converged do 

4: r[7Ti][a],...,r[7r p ][a] 

5: fixed point computation at iteration a using data set D 

6: tt;[7r 1 ][a!],...,u;[7rp][a:] 

7: increment a 


Using the read and write model defined above we can now define the notion of correct executions from an ML 
computation point-of-view. We use a single-threaded sequential or synchronous execution as a ground truth for 
correctness in an ML system. 

Definition 4 Sequential ML Computation. A sequential ML execution on II is an execution that is single-threaded 
(i.e., has no parallelism across multiple threads) and sequential (i.e., each iteration is completed before the next one 
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starts). Formally, in a sequential ML execution: (i) No operations of an iteration interleaves with the operations of 
another iteration; (ii) Within an iteration, all read operations precede any write operation; and (Hi) An operation 
of iteration a + 1 cannot appear until all operations of a have completed. 

The correctness of sequential execution is based on the observation that if iterations are executed sequentially and 
within each iteration all model parameters are read before being updated then the corresponding execution will 
preserve the semantics of an underlying ML computation. 

SEQx = r[i ti] [l]r[7r 2 ] [1 ]w[tti] [1]zo[tt 2 ] [lM^] [2]r[7t 2 ] [2] to [ 7 ^] [2]to[7r 2 ] [2] 

SEQ 2 = r[ 7ti] [l]r[7r 2 ] [l]io[?r 2 ] [l]to[7Ti] [l]r[n 2 } [2]r[tri\ [2]to[7Ti] [2]to[7t 2 ] [2] 

Figure 1: Sequential executions over II = {7Ti, 7 t 2 }. 

Figure [T| illustrates executions SEQ\ and SEQ 2 on a model DB consisting of two partitions with two iterations. 
Both executions are sequential ML computation. We note that within an iteration, the ordering of read and write 
operations (within themselves) can be permuted respectively and the execution will still be deemed correct. 


4.3 Process-centric Synchronization of Parallel ML Execution 

A sequential ML computation can be parallelized by assigning the partitions in II to different worker nodes that 
execute in parallel and are each responsible for updating the features in their partition. The overall coordination 
between the workers is carried out by the master node that initiates a worker i for each partition . Each worker 
proceeds (until convergence) in parallel in each iteration starting with the reading of the current state of II, carrying 
out the fixed point computation and then updating the value of its partition. 

Asynchronous execution of workers may lead to race conditions while reading and writing elements of II. For 
example, a worker i in iteration a + 1 may read stale value of a partition if worker j has not completed its 
write for the previous iteration a. Similarly, a worker node i in iteration a may update partition tt, before another 
worker j has had a chance to read the state of 7Tj for iteration a. Such race conditions indicate read and write 
steps of different workers must be synchronized. In general, this is accomplished using process synchronization 
which is commonly referred to as bulk synchronization (barrier constraints). Algorithm [2a] depicts a parallel ML 
computation that relies on bulk synchronization primitives such as barriers. 


1: procedure Worker i 
2 : READ BARRIER 

3: fj[7Ti], . . . ,rj[7Tp] 

4: fixed point computation using data set D 

5: WRITE BARRIER 

6 : Willti] 

(a) Parallel Algorithm with BSP 


1: procedure Worker i 
2 : READ CONSTRAINT 

3: r^TTi],... ,ri[n p \ 

4: fixed point computation using data set D 

5: WRITE CONSTRAINT 

6 : Wi[-Ki\ 

(b) Parallel Algorithm with relaxed constraints 


Alg. 2: (a) worker computation using read and write barrier and (b) same computation with the relaxed read and 
write constraints. 


We now express the barrier constraints in the notation of the database access model introduced in Definition 3. We 
express the barrier constraints using logical predicates. In particular, before the read step at worker i in iteration 
a + 1, the barrier must ensure that the writes of every worker for iteration a have completed. This can be stated as: 

Read Barrier: V i, j, k u>k [ 7 ^] [a] < n [ttj] [a + 1] 

Similarly, the write barrier synchronization stipulates that reads of all workers in iteration a have completed before 
the worker can write its partition in iteration a. This can be stated as: 

Write Barrier: V i,j,krk [itj\ [a] < w; [ 7 ^] [a] 

Having formally defined the read and write barriers (and thus BSP), we show that Algorithm [2a] is equivalent to 
sequential ML computation as defined in Definition 4 (proof in supplementary section). 

Theorem 1 An execution resulting from bulk synchronization (Algorithm 2a) is a sequential ML computation. 
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This bulk synchronization of all the workers in each iteration becomes a major inefficiency issue since the results 
of different worker nodes or threads are not guaranteed to arrive at nearly the same time. At the read barrier, all 
the workers are blocked until updates from all the workers are completed for the previous iteration. Similarly, at 
the write barrier, updates of all the workers are blocked until the reads of all the workers are completed at each 
partition for the current iteration. 

4.4 Data-centric Synchronization of Parallel ML Execution 

Although the barrier constraints ensures the correctness of ML computations, they impose (unnecessarily) strin¬ 
gent synchronization constraints. A key observation is that when worker i is reading ~ :n it only needs to be 
synchronized with respect to the concurrent writes of irj. This is exactly how database accesses from multiple 
transactions are synchronized. The read synchronization at worker i in iteration a thus can be stated as: 

Read Constraint: V i,j Wj [nf [a] < r, [7r ? ] [a + 1] 

Similarly, the write synchronization at worker i in iteration a can be stated as: 

Write Constraint: V i,j ^[ 7 ^] [a] < ^[ 7 ^] [a] 

Theorem 2 An execution where read and write constraints are enforced (Algorithm lb) has the same behavior as 
a sequential ML computation. 

Theorem 3 Executions resulting from bulk synchronization are subsumed in the executions resulting from enforce¬ 
ment of read and write constraints. 

Proofs for Theorem |T| [5] and [6] are given in Appendix 1. A direct consequence of theorem [6] is that there are less 
constraints (more possible executions) leading to greater concurrency without compromising on the correctness 
(Theorem [5}. 

5 A Data-centric Synchronization Protocol 

We present a simple protocol for data-centric synchronization to ensure correctness of ML computations. Worker 
processes run independently of each other and can be either on the same or different machines. A central server 
process is responsible for communicating the convergence of the algorithm to the workers so that they can stop 
their work. Workers send their read and write requests to the server and the computation is performed locally over 
the full training data set. The server is responsible for executing read and write operations on the data objects while 
ensuring that the read and write constraints are enforced on the parameters. 

The Write Protocol. A write operation issued from a worker i, which is in its a th iteration, on parameter partition 
chunk 7T; can be executed if this chunk has been read by all the worker processes in their a th iterations. This can be 
ensured in a very efficient way by associating a bit vector (of size equal to number of workers) with each parameter 
chunk. When a chunk is updated, all bits in this vector are set to zero. When a read operation issued by a worker fc, 
while in its a th iteration, is executed on this chunk, the bit corresponding to this worker is set. The scheduler can 
execute the above mentioned write operation by quickly checking if all bits in the bit vector are one. Otherwise, 
the write operation is deferred for later consideration. 

The Read Protocol. A read operation issued from a worker i, which is in its (a + l) th iteration, on a parameter 
chunk j can be executed if a write operation issued from a worker j, while in its a th iteration, has already been 
executed on this chunk. Again, this can also be ensured in a very efficient and simple way by associating an 
iteration number corresponding to each chunk. Every time a write operation is executed on this chunk, its iteration 
number is set to iteration number of the write operation. The above mentioned read operation can be executed 
on this chunk if the iteration number in the read operation is one more than the iteration number of the chunk. 
Otherwise, the read operation is deferred for later consideration. 

We note that the protocol outlined above is very different from the lock-based synchronization that is used in gen¬ 
eral purpose database systems. The main reason being that data-centric syncrhonization for iterative computation 
requires different approaches than traditional methods for enforcing serializability 0. 

6 Experiments 

In order to test our approach for practical applicability, we conducted an exhaustive evaluation (Figure [2) using 
both synthetic and a real world dataset. All experiments were performed on a machine with dual Intel(R) Xeon(R) 
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CPU E5-2697 v2 @ 2.70GHz CPUs. Each of these CPUs have 24 cores (48 threads total). The machine has 
256 GB memory. We next describe various experiments and discuss the results. The results were obtained by 
running the experiments 10 times each and taking trimmed mean (average after dropping 2 fastest and 2 slowest 
runs). While in practice distributed ML solutions will be deployed to handle large data, the impact of data-centric 
synchronization can be observed even on relatively small data sets. 

6.1 Scaling with Number of Workers 

For this experiment, we generated a synthetic dataset with 960 numerical features and one dependent variable. A 
linear regression model was trained over 5000 examples (gradient descent iterations until convergence). Number 
of workers was varied from 6 to 40. Figure [2a] shows the percentage improvement (from 20% to almost 55%) in 
the running time of the training phase. As the number of workers increases, data-centric synchronization gets more 
opportunity for improvement over process-centric synchronization due to the wait for more workers to finish read 
and write operations in each iteration. Figure [2b] illustrates the speedup under the two approaches and we observe 
that under BSP the speedup is relatively flat whereas data-synchronization achieves significantly better speedup. 
We have also included a curve for the theoretical limit for a completely asynchronous speed up based on Amdhal’s 
law using 0.01 for the fraction of the computation that is neccessarily serial due to memory contention and related 
issues. 


CD 

E 

CD 

> 

o 

Q. 

E 

CD 

O) 

ro 


CD 

o 

CD 

CL 


Training Data Size = 5000, Convergence Tolerance = 0.00001 
Number of features = 960 


55 

50 

45 

40 

35 

30 

25 

20 

15 



6 8 12 16 24 32 4C 


Number of workers 


Training Data Size = 5000, Convergence Tolerance = 0.00 
Number of features = 960 



Training Data Size = 500, Fixed Iterations 
80 
70 ' 

60 

50 

40 

30 

20 

240 480 960 1920 3840 7680 1536C 

Number of Features 



(a) Scaling with number of workers (b) Speedup (Batch GD) 

(Batch GD) 


(c) Scaling with number of features 
(Batch GD) 


CD 

E 

CD 

> 

o 

Q. 

E 

CD 

O) 

CO 


CD 

O 

CD 

Q. 



Number of SGD iterations 


Number of Workers 


c 

CD 

E 

0) 

> 

o 

Cl 

E 

<D 

O) 

ro 

c 


0 ) 

Q. 



(d) Performance with SGD iterations on ( e ) Performance with SGD iterations on (f) Performance with SGD and mini- 
real dataset real dataset with varying parallelism batch iterations on real dataset 


Figure 2: Experimental Results 


6.2 Scaling with Number of Features 

For this experiment, we generated several synthetic datasets each with different number of features. Linear regres¬ 
sion models were trained on 500 examples for each of the datasets (constant number of gradient descent iterations). 
All the experiments were run with number of workers set to 16, 24 and 40. Figure [2c] shows the scalability of our 
approach for large number of features. With less number of features, the trend of improvement obtained with 
different number of workers is not clear but there is a clear trend as the number of features are increased (with 
larger number of workers, we get more improvement). In particular, with 16 workers, the percentage improvement 
declines significantly from the high of 75% to 25% as the number of features is increased. However, when the 
amount of parallelism is increased by deploying more workers, the overall improvement with a large number of 
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features is around 50% indicating that data-centric synchronization does result in higher-level of parallelism and 
concurrency in the system. 

6.3 Experiments with real world dataset 

In Figures [2d] [2e] and [2f] we compare our approach with BSP on a real world dataset (8| which has 150,360 
features and 16,087 training examples. Figure [2d] reports the performance improvement and absolute times of 
running stochastic gradient descent (SGD) with varying number of iterations with 6 workers. On real dataset, our 
approach results in significant improvements since the percentage improvement ranges between 65% to almost 
75%. Figure [2e] reports the performance improvements for different number of iterations by varying the number 
of workers. In this experiment, we get a consistent trend that the percentage improvement declines from the high 
of 70-75% to 40-50% (which is still a significant improvement). The explanation of this decline is that under SGD 
the amount of work done per iteration is much smaller (1 training sample versus the entire dataset). Furthermore, 
as the number of workers is increased from 6 to 40, the work per assignment gets further reduced due to feature 
partitioning. This results in synchronization overhead becoming dominant over useful computation. By analyzing 
the raw data, we discovered that the rate of increase of the synchronization overhead of BSP is relatively less 
than that with our technique. This can be explained that as the amount of useful work being done by each worker 
becomes smaller, the window of delay between slowest and the fastest worker finishing their respective assignment 
becomes increasing smaller. On the other hand, under our scheme the synchronization check with a larger number 
of workers increases while the amount of useful work per unit continues to decrease. This explains the overall 
decline in percentage improvement with increased parallelism. In order to validate this hypothesis, in Figure|2f|we 
compared the relative performance of the two protocols computing Gradient Descent with a mini-batch where the 
batch size was fixed to 100 training examples. In relative terms, indeed the decline in performance improvement 
is much more pronounced in SGD whereas it is not as sharp under mini-batch. Detailed analysis of raw data 
indicates that in the case of mini-batch, both approaches benefit from increased parallelism but beyond a certain 
point increased parallelism is not beneficial. We attribute this to a small amount of useful computation per iteration. 
This performance degradation can be addressed by exploiting data partitioning and data sparsity. 

7 Data-centric Synchronization with Admissible Delay 

Function-based synchronization exploit the semantics of the underlying function being optimized and leverage it to 
increase asynchrony (equivalently, reduce synchronization and permit more concurrency and interleavings among 
workers) in the parallel ML computation. This asynchrony can be captured by allowing the read operations to read 
stale writes. Similarly, a write operation in an iteration does not have to be blocked until the reads from all the 
workers in that iteration have been processed. Instead, the write of a worker can be processed as long as all the 
reads are within some distance of the write (where distance is specified in terms of the iteration number). The 
protocol has a notion of delay <5 > 0 which stipulates that workers are separated from each other (i.e., read and 
write operations) within S iterations. 

The weaker form of read synchronization at worker i in iteration a thus can be stated as: 

Asynchronous Read Constraint: V i,j Wj[TVj][a — 1 — 5] < r.,[7r ; y][a] 

Similarly, the weaker form of write synchronization at worker i in iteration a can be stated as: 

Asynchronous Write Constraint: V i, j r :l [7r, ( ] [a — <5] < w t [7r, ( ] [a] 

We note that when 5 = 0, the above constraints yield executions that ensure sequential semantics. On the other 
hand, if 5 = oo, the parallel execution is completely asynchronous and reduce to parallel executions resulting from 
a system such as Hogwild! hd. 

7.1 Revised Protocol 

The Write Protocol. A write operation issued from a worker i, which is in its a th iteration, on parameter partition 
chunk 7Tj can be executed if the slowest worker to read this chunk is no more than 5 iterations behind worker i. To 
ensure this, an array (of size equal to number of workers) can be associated with each parameter chunk. When a 
read operation issued by worker k, while in its 0 th iteration, is executed on this chunk, the element corresponding 
to this worker is set to 0. The scheduler can execute the above mentioned write operation by checking if the 
minimum number in this array is greater than or equal to a — 6. Otherwise, the write operation is deferred for later 
consideration. 
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The Read Protocol. A read operation issued from a worker i, which is in its a th iteration, on a parameter 
chunk j can be executed if the last write operation executed on this chunk was issued by worker j, while in its 
a — S — 1 th iteration or later. To ensure this, an iteration number can be associated with each chunk. Every time a 
write operation is executed on this chunk, its iteration number is set to the iteration number during which this write 
operation was issued. The above mentioned read operation can be executed on this chunk if the iteration number of 
the chunk is greater than or equal to (a — S — 1). Otherwise, the read operation is deferred for later consideration. 

8 Conclusion 

In this paper we have presented a new data-centric synchronization paradigm for carrying out machine learning 
tasks in a distributed environment. Our approach abstracts the iterative nature of ML algorithms and introduces 
specific read and write constraints (RC and WC) whose enforcement guarantees sequential correctness while pro¬ 
viding opportunity to speed up ML computation. We also show that the bulk synchronization process (BSP) 
design pattern, which is extensively used in distributed ML tasks, implies RC and WC. Our proposal complements 
function synchronization techniques in distributed ML research which uses “bounded staleness” to relax BSP and 
increase throughput. 
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9 Proofs 


Theorem 4 An execution resulting from bulk synchronization (Algorithm la) is a sequential ML computation. 

Proof. Let BSP be an execution using partition set II. involving p partitions and p workers as shown in Algorithm 
la. In order to show that BSP is sequential, we need to establish: 

• No operations of two different iterations are interleaved. 

• Within each iteration, all reads precedes any write. 

• Iterations are executed consecutively. 

Condition 1 follows from the read barrier which enforces that writes of the previous iteration are completed before 
reads from the next iteration can begin. 

Condition 2 is a consequence of the write barrier. 

Condition 3 is a consequence of the fact that 


\/k,w k [n k \[a} < w k {ir k }[a + 1] 

(combining read and write barrier). □ 

Theorem 5 An execution where read and write constraints are enforced (Algorithm lb) has the same behavior as 
a sequential ML computation. 

Proof: We need to ensure that following conditions are satisfied for every partition: 

• No operation on a partition in an iteration interleaves with the operations on the same partition in another 
iteration. 

• Within an iteration, all read operations on a partition precede any write operation on the same chunk. 

• An operation on a partition in iteration a + 1 cannot appear until all operations on the same partition in a 
have completed. 

To prove condition 1, let’s assume that an execution contains a fragment opi[7Ti][a]op2[^j][adH]op3[7rfc][a], where 
op is either read or write operation. We have to show that j f k. We can prove this by contradiction by starting 
with assumption j = k. The following cases are possible 

i) op 2 = R and op 3 = R 

This means that for some x and y 

r x [TT j }{a + l]<r y [K J ][a\ (1) 

However, from read constraint it follows that 


Wj[Ttj ]M < r x [nj][a+ 1] 


( 2 ) 


From Q] and [2] 


< r vWj][ a ] 

This is a violation of our write constraint, 
ii) op 2 = R and op 3 = W 
This means that for some x 

r x['Kj\[ot + 1] < WjlttjUa] 

This case is a direct violation of read constraint. 
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iii) op 2 = W and op 3 = R 
This means that for some x 


w A^j}[ a + !] < r x[^j ]H 


(3) 


However, from write constraint, it follows that 


Tj [ 7 Tj] [a + 1] < Wj [ 7 Tj] [a + 1] 


(4) 


and from read constraint, it follows that 


“iWW < ? tW[ Q! + !] 


(5) 


From [3] [4] and|5] it follows that Wj [tt 7 ] [a] < r x [77 7 ] [a], which violates write constraint, 
iv) op 2 = W and op 3 = W 
This means that 


w A 7T j][u+ l] < w^-KjWa] 


( 6 ) 


However, from write constraint, it follows that 

V i , Vi [ 7 Tj] [a + 1] < Wj [7Tj] [a + 1] 

and from read constraint, it follows that 


(7) 


M i,Wj[-Kj][a] < ri[TTj \[a + 1] 


( 8 ) 





Condition 2 directly follows from the write constraint. 

Condition 3 follows from read and write constraints. Read constraint stipulates that any read on a chunk in iteration 
a + 1 happens only after the write on this chunk in iteration a has finished. And write constraint stipulates that the 
write on this chunk in iteration a happens only after all reads on this chunk in iteration a are finished. Thus, all 
operations in a are finished before any operation in a + 1 begin. 

Theorem 6 Executions resulting from bulk synchronization are subsumed in the executions resulting from enforce¬ 
ment of read and write constraints. 

Proof: It is clear from the proof given in Theorem [5] that relaxed constraints are special cases of barrier conditions 
applied on per partition level. 

9.1 Examples of Executions 

In Figure[3] we show three possible executions on a model database. Consider a model database with two partitions 
{ 771 , 772 }. Assume that the ML computation consisted of two iterations. Execution Hi is the one that results from 
BSP and from Theorem[4] it obviously results in correct results. Execution H 2 is one of the several more exeuctions 
possible by relaxing the barrier conditions in our model. In Theorem[5] we showed that these executions also give 
the correct results. However, H 3 is an example of executions that are permitted neither by the BSP nor the RC and 
WC. These executions lead to incorrect results. More possible executions as compared to BSP lead to increased 
concurrency and hence performance improvement. 


Hi = r 1 [ 77 ^ [l]ri [ 77 2 ] [l]r 2 [t 7 i] [l]r 2 [tt 2 ] [ 1 ]wi [ 77 ^ [l]iu 2 [tt 2 ] [l]n [ 771 ] [2 )n [7 r 2 ] [2]r 2 [ 771 ] [2]r 2 [tt 2 ] [2]wi [tti] [2}w 2 [tt 2 ] [2] 
H 2 = ?’! [ 771 ] [l]n [ 7 r 2 ] [l]r 2 [ 771 ] [l]r 2 [7r 2 ] [1 ]w 2 [7t 2 ] [l]n [ 7 r 2 ] [2]wi [ 771 ] [l]n [7n] [2]r 2 [ 771 ] [2] r 2 [7r 2 ] [2}wi [7Ti] [2 ]w 2 [ir 2 \ [2] 
H 3 = 7’1 [77!] [l]ri [772] [l]wi [ 771 ] [l]r 2 [77i] [l]r 2 [ 772 ] [1 \w 2 [77 2 ] [l]n [ 771 ] [2]n [tt 2 ] [2}wi [tti] [2]r 2 [77 X ] [2]r 2 [7 r 2 ] [2 ]w 2 [n 2 \ [2] 


Figure 3: Examples of execution histories of two worker computations with two iterations over { 771 , 77 2 }. 
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